TUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation
نویسندگان
چکیده
This article presents an overview of the existing acoustical corpuses suitable for broadcast news automatic transcription task in the Slovak language. The TUKE-BNews-SK database created in our department was built to support the application development for automatic broadcast news processing and spontaneous speech recognition of the Slovak language. The audio corpus is composed of 479 Slovak TV broadcast news shows from public Slovak television called STV1 or “Jednotka” containing 265 hours of material and 186 hours of clean transcribed speech (4 hours subset extracted for testing purposes). The recordings were manually transcribed using Transcriber tool modified for Slovak annotators and automatic Slovak spell checking. The corpus design, acquisition, annotation scheme and pronunciation transcription is described together with corpus statistics and tools used. Finally the evaluation procedure using automatic speech recognition is presented on the broadcast news and parliamentary speeches test sets.
منابع مشابه
An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation
In this paper, we introduce an extension of our previously released TUKE-BNews-SK corpus based on a semi-automatic annotation scheme. It firstly relies on the automatic transcription of the BN data performed by our Slovak large vocabulary continuous speech recognition system. The generated hypotheses are then manually corrected and completed by trained human annotators. The corpus is composed o...
متن کاملThai Broadcast News Corpus Construction and Evaluation
Large speech and text corpora are crucial to the development of a state-of-the-art speech recognition system. This paper reports on the construction and evaluation of the first Thai broadcast news speech and text corpora. Specifications and conventions used in the transcription process are described in the paper. The speech corpus contains about 17 hours of speech data while the text corpus was...
متن کاملMatbn 2002: a Mandarin Chinese Broadcast News Corpus
The MATBN 2002 Mandarin Chinese broadcast news corpus contains a total of 40 hours of broadcast news from Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary motivation for this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast domain. We expect to collect and process 220 hours of Mandarin Chine...
متن کاملMATBN: A Mandarin Chinese Broadcast News Corpus
The MATBN Mandarin Chinese broadcast news corpus contains a total of 198 hours of broadcast news from the Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary purpose of this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast news domain. In this paper, we briefly introduce the speech corpus and r...
متن کاملLanguage-model Investigations Related to Broadcast News
In this paper we present some experiments that have been performed while developing language models for the PHILIPS Broadcast News system. Three main issues will be discussed: construction of phrases, adaptation of remote corpora to this task, and the combination of the diierent models. Also, per-plexities on the 1997 evaluation data are reported.
متن کامل